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Abstract 

This  paper  provides  quantitative  measurements  of  load 
latency  tolerance  in  a  dynamically  scheduled  processor.  To 
determine  the  latency  tolerance  of  each  memory  load 
operation,  our  simulations  use  flexible  load  completion 
policies  instead  of  a  fixed  memory  hierarchy  that  dictates 
the  latency.  Although  our  policies  delay  load  completion 
as  long  as  possible,  they  produce  performance  (instruc¬ 
tions  committed  per  cycle  (IPC))  comparable  to  an  ideal 
memory  system  where  all  loads  complete  in  one  cycle.  Our 
measurements  reveal  that  to  produce  IPC  values  within 
8%  of  the  ideal  memory  system,  between  1%  and  62%  of 
loads  need  to  be  satisfied  within  a  single  cycle  and  that  up 
to  84%  can  be  satisfied  in  as  many  as  32  cycles,  depending 
on  the  benchmark  and  processor  configuration.  Load 
latency  tolerance  is  largely  determined  by  whether  an 
unpredictable  branch  is  in  the  load’s  data  dependence 
graph  and  the  depth  of  the  dependence  graph.  Our  results 
also  show  that  up  to  36%  of  all  loads  miss  in  the  level  one 
cache  yet  have  latency  demands  lower  than  second  level 
cache  access  times.  We  also  show  that  up  to  37%  of  loads 
hit  in  the  level  one  cache  even  though  they  possess  enough 
latency  tolerance  to  be  satisfied  by  lower  levels  of  the 
memory  hierarchy. 

1  Introduction 

Many  of  today’s  microprocessors  use  dynamic  schedul¬ 
ing  [17,18]  to  maximize  the  number  of  instructions  issued 
per  cycle.  By  buffering  instructions  that  are  waiting  for 
their  operands  and  executing  other  independent  instruc- 
tions  out  of  order,  the  processor  is  able  to  tolerate  some 
long  latency  operations — including  cache  misses.  To  find 
enough  independent  instructions,  most  processors  employ 
sophisticated  branch  prediction  mechanisms  [13,  21]  and 
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allow  speculative  execution  [4,  9,  19],  committing  results 
only  when  the  true  outcome  of  a  branch  is  known. 

Unfortunately,  because  of  finite  resources,  data  depen¬ 
dencies  and  imperfect  branch  prediction,  some  operations 
must  complete  quickly  to  maximize  processor  perfor¬ 
mance.  Consider  a  processor  capable  of  issuing  8  instruc¬ 
tions  per  cycle  and  a  10  cycle  level  two  cache  access 
latency.  In  the  time  it  takes  the  level  two  cache  to  satisfy  a 
load  request,  the  processor  could  issue  up  to  80  instruc¬ 
tions.  If  many  of  these  instructions  are  dependent  on  the 
value  returned  by  the  load  and  there  is  insufficient  buffer 
space  because  of  previous  long  latency  operations,  the  10 
cycle  load  operation  would  cause  the  processor  to  stall.  In 
contrast,  if  many  of  the  instructions  are  independent  and 
there  is  sufficient  buffer  space,  their  execution  could  over¬ 
lap  with  the  load,  and  not  stall  the  processor.  Load  latency 
tolerance  exists  in  the  latter  case,  when  a  memory  refer¬ 
ence  can  take  many  cycles  to  complete  without  adversely 
affecting  performance. 

The  first  contribution  of  this  paper  is  to  present  a  quan¬ 
titative  evaluation  of  load  latency  tolerance  in  a  dynami¬ 
cally  scheduled  processor.  Using  SimpleScalar  [2]  we 
measure  individual  load  instruction  latency  tolerance  by 
forcing  their  completion  such  that  the  number  of  instruc¬ 
tions  committed  per  cycle  (IPC)  is  comparable  to  an  ideal 
memory  system  that  satisfies  all  requests  in  a  single  cycle. 
We  evaluate  a  variety  of  polices  to  force  load  completion 
in  an  effort  to  balance  high  IPC  values  with  long  load 
latencies.  We  find  that  using  mispredicted  branches  and 
the  depth  of  a  load’s  dependence  graph  to  determine  when 
loads  should  complete,  produces  IPC  values  within  8%  of 
the  ideal  memory  system,  while  yielding  noticeable 
latency  tolerance. 

Our  measurements,  on  an  8  issue  processor  that  can 
have  up  to  256  instructions  in  flight,  show  that  between 
13%  and  62%  of  the  loads  in  our  benchmarks  need  to 
complete  in  one  cycle  and  that  58%  to  98%  must  complete 
in  8  cycles.  Reducing  the  issue  width  to  4,  reduces  the 
number  of  one  cycle  loads  to  between  1%  and  46%  and  the 
number  of  8  cycle  loads  to  5%  to  88%.  These  results  show 
that  many  loads  could  be  satisfied  with  latencies  compara¬ 
ble  to  second-level  cache  hit  times. 
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The  second  contribution  of  this  paper  is  an  evaluation 
of  the  match  between  the  latencies  incurred  in  a  traditional 
memory  hierarchy  and  an  application’s  inherent  latency 
demands.  We  find  that  between  2%  and  36%  of  loads 
requiring  latency  below  8  cycles  miss  in  the  first  level 
cache,  depending  on  the  cache  size  and  application.  Fur¬ 
thermore,  our  results  reveal  that  between  2%  and  37%  of 
loads  that  hit  in  the  first  level  cache  have  enough  latency 
tolerance  that  they  could  be  satisfied  by  lower  levels  of  the 
memory  hierarchy. 

The  remainder  of  this  paper  is  organized  as  follows. 
Section  2  provides  background  information.  Section  3 
describes  our  technique  for  measuring  the  available  load 
latency  tolerance.  Our  experimental  methodology  is  pre¬ 
sented  in  Section  4,  and  Section  5  presents  our  results. 
Section  6  concludes  this  paper. 

2  Background 

Superscalar  processors  maximize  serial  program  per¬ 
formance  by  issuing  multiple  instructions  per  cycle.  Each 
cycle  the  processor  attempts  to  issue  up  to  issue-width 
instructions.  One  of  the  most  important  aspects  of  these 
systems  is  identifying  independent  instructions  that  can 
execute  in  parallel. 

2.1  Scheduling,  Prediction,  and  Speculation 

In  order  to  identify  and  exploit  instruction  level  paral¬ 
lelism,  most  of  today’s  processors  employ  dynamic  sched¬ 
uling,  branch  prediction,  and  speculative  execution. 
Dynamic  scheduling  is  an  all  hardware  technique  for  iden¬ 
tifying  and  issuing  multiple  independent  instructions  in  a 
single  cycle.  The  hardware  looks  ahead  by  fetching 
instructions  into  a  buffer — called  an  issue  window — from 
which  it  selects  instructions  to  issue  to  the  functional  units. 
Instructions  are  issued  only  when  all  their  operands  are 
available,  and  independent  instructions  can  execute  out-of- 
order.  Results  of  instructions  executed  out-of-order  are 
committed  to  the  register  file  in  program  order. 

The  issue  window  is  filled  with  instructions  from  sev¬ 
eral  basic  blocks  by  predicting  the  direction  of  conditional 
branches  [3,  6,  10,  12,  14].  Furthermore,  the  instructions 
from  the  predicted  path  are  speculatively  executed,  hoping 
the  branch  prediction  is  correct.  These  instructions  can  not 
update  the  architectural  state  of  the  processor  until  the  true 
outcome  of  the  branch  is  determined.  While  waiting  for 
the  branch  computation,  these  instructions  occupy  valu¬ 
able  buffer  space  potentially  reducing  the  issue  rate. 

The  above  techniques  are  very  effective  for  well- 
behaved  programs  with  short-latency  operations.  How¬ 
ever,  long  latency  operations,  such  as  load  cache  misses, 
can  reduce  their  effectiveness  for  the  following  reasons: 


1.  Data  Dependencies:  If  the  operands  of  instruction  i 
are  produced  by  a  previous  instruction  j,  then  i  is 
dependent  on  j,  and  i  can  begin  execution  only  after  j 
has  completed.  Clearly,  instruction  i  can  not  be  issued 
in  the  same  cycle  as  j.  While  looking  ahead,  if  most  of 
the  newly  dispatched  instructions  are  dependent  on 
earlier  issued  instructions  waiting  to  complete,  even  a 
dynamically  scheduled  processor  will  be  unable  to 
identify  ready  instructions  to  execute,  and  the  proces¬ 
sor  utilization  will  go  down. 

2.  Finite  Resources:  The  number  of  instructions  the  pro¬ 
cessor  can  look  ahead  to  identify  independent  instruc¬ 
tions  is  limited  by  the  number  of  available  entries  in 
the  issue  window.  Dependent  instructions  waiting  for 
the  result  of  a  load  and  independent  instructions  wait¬ 
ing  to  commit  their  results  occupy  entries.  For  long 
latency  operations,  the  window  could  become  full  and 
stall  the  processor. 

3.  Branch  Misprediction:  When  the  predicted  outcome 
of  a  branch  is  incorrect,  the  work  performed  by  the 
processor  while  executing  along  the  incorrect  path  is 
useless.1  If  computation  of  the  true  branch  outcome  is 
delayed  because  of  a  cache  miss,  the  processor  could 
waste  many  cycles  before  the  misprediction  is 
detected. 

The  remainder  of  this  paper  examines  load  latency  tol¬ 
erance  in  the  context  of  the  above  potential  limitations. 

3  Measuring  Load  Latency  Tolerance 

This  section  presents  the  policies  we  use  to  determine 
the  latency  tolerance  of  load  instructions.  Our  primary 
goal  is  to  quantify  the  amount  of  latency  tolerance  in 
dynamically  scheduled  processors.  We  also  want  to  deter¬ 
mine  if  there  is  variation  among  individual  load  latency 
tolerance.  That  is,  do  some  loads  require  fast  servicing 
while  others  can  be  satisfied  in  longer  amounts  of  time 
without  degrading  performance?  Finally,  we  want  to  eval¬ 
uate  the  match  between  a  load’s  latency  tolerance  and  the 
latency  it  incurs  in  a  conventional  cache  memory  hierar¬ 
chy. 

We  compute  the  latency  tolerance  of  a  load  by  measur¬ 
ing  the  number  of  cycles  that  elapse  between  the  time  the 
load  is  issued  and  the  time  it  completes.  A  load  is  issued  to 
the  memory  system  when  its  effective  address  is  available, 
and  it  completes  when  the  referenced  data  is  available  for 
use  by  dependent  instructions.  In  a  traditional  memory 
system,  a  load's  latency  depends  primarily  on  which  level 
in  the  memory  hierarchy  satisfies  the  request. 


1 .  Recent  techniques  show  how  to  reuse  some  of  the  results  obtained  dur¬ 
ing  speculative  execution  [15]. 


Our  goal  is  to  determine  how  long  a  load  can  be  out¬ 
standing  without  causing  degradation  in  performance. 
Hence,  we  do  not  complete  a  load  as  long  as  the  processor 
is  able  to  do  useful  work  by  looking  ahead  and  executing 
independent  instructions.  In  particular,  we  want  our  mea¬ 
sured  latencies  to  reflect  a  program  execution  that  achieves 
IPC  close  to  that  of  an  ideal  memory  system,  where  all  ref¬ 
erences  complete  in  one  cycle. 

Previous  studies,  that  examined  latency  tolerance  in 
decoupled  architectures  [8],  analyzed  the  effects  of 
increasing  memory  latency  for  systems  both  with  and 
without  caches.  In  the  systems  without  caches,  this  pro¬ 
duces  a  uniform  increase  in  latency  for  all  memory 
accesses.  This  type  of  analysis  can  provide  some  insight 
into  latency  tolerance.  However,  it  is  an  all  or  nothing 
approach  where  every  load  has  the  same  cost,  and  is  suit¬ 
able  only  for  specific  memory  system  designs.  Similarly, 
increasing  the  memory  latency  in  cache  based  systems 
does  not  accurately  measure  individual  load  latency  toler¬ 
ance,  and  the  results  may  be  highly  dependent  on  the 
cache  organization.  The  latency  tolerance  of  references 
that  hit  in  the  cache  is  not  measured,  even  though  it  may 
be  quite  large. 

Our  methodology  extends  this  previous  work,  and  is 
targeted  at  measuring  the  latency  tolerance  of  individual 
load  instructions.  We  rely  on  the  ability  to  force  comple¬ 
tion  of  loads  at  arbitrary  times  to  ensure  the  processor  is 
able  to  continue  issuing  instructions.  Note  this  approach 
measures  latency  tolerance  in  the  context  of  a  processor 
with  constrained  resources.  Eliminating  these  constraints 
would  be  an  interesting  study,  but  is  beyond  the  scope  of 
this  paper. 

In  our  scheme,  the  measurement  of  load  latency  is 
decomposed  into  the  following  four  steps,  which  we  elab¬ 
orate  on  in  the  remainder  of  this  section: 

1 .  Determining  that  one  or  more  loads  should  complete, 

2.  Determining  when  the  load(s)  should  complete, 

3.  Determining  which  specific  load(s)  should  complete, 
and 

4.  Determining  how  many  loads  should  complete. 

3.1  Determining  that  Loads  Should  Complete 

Our  goal  is  to  allow  loads  to  remain  outstanding  as  long 
as  they  are  not  adversely  affecting  performance.  There¬ 
fore,  to  determine  if  any  load(s)  should  be  forced  to  com¬ 
plete  we  must  first  determine  if  the  processor  performance 
is  degrading.  Recall  that  the  performance  of  dynamically 
scheduled  processors  can  degrade  because  it  is  unable  to 
execute  independent  instructions  due  to  limited  buffer 
space,  data  dependencies,  or  it  executes  useless  instruc¬ 
tions  due  to  incorrect  branch  prediction.  Therefore,  we 
force  loads  to  complete  if  their  results  ensure  the  processor 


executes  useful  instructions  or  enable  the  processor  to  sus¬ 
tain  reasonable  execution  rates. 

Branch-based  Load  Completion 

Most  modern  processors  predict  the  outcome  of 
branches  and  speculatively  execute  instructions  on  the  pre¬ 
dicted  path.  On  a  misprediction,  all  the  work  done  by  the 
processor  in  speculative  mode  is  useless.  Delaying  com¬ 
pletion  of  a  load  on  which  a  branch  instruction  is  depen¬ 
dent  can  increase  the  number  of  mis -speculated 
instructions  executed  and  therefore  degrade  performance. 
Hence,  loads  on  which  branches  are  dependent  need  to  be 
given  priority  for  early  completion.  Moreover,  it  is  only 
mispredicted  branches  that  cause  the  processor  to  execute 
useless  instructions.  Therefore,  it  is  sufficient  to  force  a 
load  to  complete  as  soon  as  a  mispredicted  branch  attaches 
itself  to  the  load's  dependency  graph. 

Performance-based  Load  Completion 

Using  branch  prediction  information  to  force  comple¬ 
tion  of  certain  loads  ensures  the  processor  is  executing 
useful  instructions.  However  that  alone  is  not  enough. 
Arbitrarily  delaying  completion  of  the  rest  of  the  loads 
will  aggravate  the  data  dependencies  problem  and  the 
finite  resources  problem  mentioned  in  the  previous  sec¬ 
tion.  This  could  prevent  the  processor  from  sustaining  a 
reasonable  level  of  performance. 

To  decide  if  loads  should  complete  because  of  proces¬ 
sor  performance,  we  can  monitor  one  of  two  standard  pro¬ 
cessor  performance  metrics:  instruction  issue  rate  or 
functional  unit  utilization.  When  the  processor  perfor¬ 
mance  drops,  we  complete  loads  freeing  up  dependent 
instructions  as  well  as  buffer  space.  In  order  to  attain  high 
IPCs  we  do  not  delay  load  completion  until  the  processor 
actually  comes  to  a  stand  still.  Rather,  we  complete  loads 
as  soon  as  the  number  of  instructions  issued  or  the  number 
of  computational  units  that  are  busy  drops  below  a  tunable 
threshold. 

Loads  can  also  be  forced  to  complete  when  there  is  a 
system  call.  However,  for  our  benchmarks  there  are  very 
few  system  calls,  therefore  we  do  not  discuss  this  case  fur¬ 
ther  in  this  paper. 

3.2  Determining  Which  Loads  to  Complete 

Once  it  is  determined  that  some  load(s)  must  complete, 
we  need  to  decide  which  specific  load  to  complete.  In  the 
case  of  mispredicted  branches,  clearly  the  load  on  which 
the  branch  is  dependent  must  be  completed  to  ensure  the 
execution  of  useful  instructions.  In  contrast,  we  have  com¬ 
plete  freedom  to  choose  any  load  for  completion  when  the 
issue  rate  or  functional  unit  utilization  decreases.  We 
investigate  two  policies:  fifo  and  dependence  graph  depth. 
The  fifo  policy  simply  forces  the  longest  outstanding  load 


to  complete.  The  second  policy  tracks  the  depth  of  a  load’s 
dependence  graph  in  cycles.  The  load  with  the  largest 
value  is  chosen  for  completion,  since  delaying  it  can 
occupy  resources  for  an  extensive  period  of  time. 

3.3  Determining  When  Should  Loads  Complete 

Having  established  which  loads  to  complete,  the  next 
step  is  to  determine  when  (i.e.,  in  which  cycle)  they  must 
complete.  To  minimize  execution  of  useless  instructions 
due  to  mispredicted  branches,  we  must  complete  the 
appropriate  load  such  that  the  entire  dependence  chain 
between  the  load  and  the  branch  completes  execution 
before  the  branch.  This  requires  the  load  to  complete  many 
cycles  before  we  actually  detect  the  mispredicted  branch. 
Section  4  describes  how  our  simulations  accomplish  this. 
If  the  load  is  forced  to  complete  because  of  issue  rate  or 
functional  unit  utilization,  we  could  naively  complete  it  in 
the  same  cycle  that  we  detected  the  degradation  in  perfor¬ 
mance.  However,  this  may  not  provide  enough  time  for  the 
pipeline  to  fill  up  with  ready  instructions,  and  we  may 
want  the  load  to  complete  earlier.  Therefore,  we  use  a  tun¬ 
able  threshold  for  load  precompletion  time  to  study  the 
effect  of  pipeline  fill-up  time  on  load  latency  tolerance. 

3.4  Determining  How  Many  Loads  to  Complete 

Finally,  in  order  to  obtain  an  instruction  issue  rate  or 
functional  unit  utilization  above  the  set  threshold,  we  may 
need  to  complete  more  than  one  load  in  a  given  cycle.  An 
important  parameter  in  this  scenario  is  the  limit  on  the 
number  of  loads  that  may  complete.  We  study  this  by  lim¬ 
iting  the  number  of  loads  that  can  complete  in  a  single 
cycle  to  one,  two,  or  four. 

4  Experimental  Methodology 

To  perform  our  measurements  we  modified  SimpleSca- 
lar  [2],  which  models  a  dynamically  scheduled  processor 
using  a  Register  Update  Unit  (RUU)  and  a  Load/Store 
Queue  (LSQ)  [16].  The  processor  pipeline  stages  are: 
Fetch:  Fetch  instructions  from  the  program  instruction 
stream. 

Dispatch:  Decode  instructions,  allocate  RUU,  LSQ 
entries. 

Issue/Execute:  Execute  ready  instructions  if  the  required 
functional  units  are  available. 

Writeback:  Supply  the  results  of  the  operation  to  depen¬ 
dent  instructions. 

Commit:  Commit  results  to  the  register  file  in  program 
order,  free  RUU  and  LSQ  entries. 

Our  baseline  processor  is  an  8-issue  machine  with  8 
integer  adders,  4  integer  multiply/divide  units,  8  floating 
point  adders,  4  floating  point  multiply/divide  units,  and  8 
cache  ports.  We  assume  256  RUU  entries  and  128  LSQ 


entries,  a  2-level  branch  predictor  with  a  total  of  8192 
entries,  and  that  all  stores  complete  in  a  single  cycle. 

When  necessary  we  assume  a  base  two  level  cache  con¬ 
figuration  using  a  32KB  direct-mapped  LI,  with  32  byte 
blocks  and  8  ports.  The  L2  is  1MB  direct-mapped  with  64 
byte  blocks,  a  single  port  and  8  cycles  to  satisfy  an  LI 
miss.  Both  caches  support  up  to  16  outstanding  misses,  are 
fetch-on-write  writeback,  and  have  a  24  entry  write-back 
buffer  with  a  high  watermark  of  12.  Contention  is  modeled 
in  all  parts  of  the  memory  system. 

Many  of  the  load  completion  policies  outlined  in  the 
previous  section  decouple  detecting  that  a  load  must  com¬ 
plete  from  determining  when  the  load  should  complete. 
Therefore,  it  is  possible  for  our  scheme  to  determine  that  a 
load  should  have  completed  even  before  we  detect  that  it 
should  complete.  Recall  the  scenario  where  we  detect  that 
a  load  should  complete  when  a  branch  is  dispatched  and 
attaches  itself  to  the  dependence  graph  of  the  load. 
Assume  this  detection  occurs  at  cycle  t  and  there  are  d 
cycles  worth  of  instructions  in  the  dependence  chain  from 
the  load  to  the  branch.  To  minimize  execution  of  useless 
instructions,  we  determine  that  the  load  should  complete  at 
cycle  ( t-d ),  d  cycles  before  we  even  establish  the  load 
should  complete. 

To  support  this  type  of  analysis,  we  added  rollback 
capabilities  to  our  simulator.  This  allows  us  to  look  ahead 
to  compute  load  completion  time,  rollback  the  processor, 
and  then  restart  execution  using  the  predetermined  load 
latency.  The  replayed  execution  may  itself  incur  rollbacks. 
This  technique  ensures  the  processor  instruction  schedule 
is  determined  by  the  measured  latency  values.  Supporting 
rollback  requires  logging  all  processor  state  at  the  end  of 
each  simulated  cycle.  We  limit  the  maximum  number  of 
cycles  a  load  can  be  outstanding  to  32  and  therefore  a  sin¬ 
gle  load  can  cause  the  processor  to  rollback  a  maximum  of 
32  cycles. 

Simulating  a  detailed  out-of-order  processor  takes  an 
enormous  amount  of  time,  and  the  rollback  capabilities  we 
added  only  increase  simulation  time.  Therefore,  we  also 
modified  SimpleScalar  to  support  sampling.  Our  sampling 
technique  alternates  between  a  detailed  out-of-order  simu¬ 
lator  and  a  faster  functional  simulator  that  also  maintains 
the  contents  of  the  memory  hierarchy. 

Finally,  to  evaluate  the  effectiveness  of  traditional 
memory  hierarchies  at  capturing  latency  tolerance,  we 
simulate  a  two-level  memory  hierarchy,  as  described 
above,  in  the  same  execution  as  the  latency  tolerance  anal¬ 
ysis.  This  enables  comparison  between  the  measured 
latency  tolerance  and  where  in  the  conventional  memory 
hierarchy  the  request  is  satisfied.  The  load  timing  is  dic¬ 
tated  by  the  latency  tolerance  measurements,  and  we  sim¬ 
ply  track  the  contents  of  the  memory  hierarchy.  Because  of 
the  different  processor  schedule,  there  may  be  some  inac- 


curacies  on  the  contents  of  the  caches  compared  to  an  exe¬ 
cution  with  load  latency  dictated  by  the  conventional 
caches.  However,  we  believe  our  approach  is  sufficient  for 
this  study. 

The  following  section  presents  our  analysis  using  a 
subset  of  the  SPEC95  benchmarks:  compress,  gcc,  li, 
vortex,  hydro2d,  swim,  tomcatv,  and  wave.  The 
benchmarks  are  all  compiled  using  the  version  of  gcc 
provided  with  SimpleScalar  and  with  optimization  -02. 
We  run  each  benchmark  operating  on  its  reference  data  set 
until  10  billion  instructions  commit  using  1%  sampling. 
This  sampling  ratio  produces  IPC  values  within  5%  of 
complete  simulations. 

5  Experimental  Results 

This  section  presents  our  simulation  results.  We  begin 
by  examining  the  performance  of  our  benchmarks  for  dif¬ 
ferent  memory  systems.  This  is  followed  by  analysis  of  the 
effects  of  branch  prediction  on  latency  tolerance.  We  then 
analyze  the  effects  of  varying  how  to  determine  that  loads 
should  complete,  how  to  select  loads  to  complete,  how  to 
compute  the  time  that  loads  should  complete,  and  how 
many  loads  are  completed.  We  finish  by  examining  vari¬ 
ous  processor  configurations  and  investigating  the  match 
between  the  latencies  incurred  in  conventional  multi-level 
memory  hierarchies  and  the  program’s  measured  latency 
tolerance. 

5.1  Fixed  Latency  Memory  Systems 

One  approach  to  obtain  information  on  the  amount  of 
latency  tolerance  in  a  system  is  to  evaluate  its  performance 
for  various  memory  system  delays.  We  performed  this 
experiment  by  examining  memory  systems  ranging  from 
simple  fixed  cost  memory  accesses  with  no  contention  to 
detailed  memory  hierarchies  with  contention  accurately 
modeled  at  all  levels.  The  fixed  cost  memory  systems 
assume  all  loads  take  the  same  amount  of  time,  we  exam¬ 
ined  1,  8,  and  32  cycle  memory  accesses.  The  detailed 
two-level  memory  hierarchies  assume  the  base  1MB  sec¬ 
ond  level  cache,  but  vary  the  first-level  configuration  and 
the  second-level  miss  penalty.  Specifically,  we  simulated  a 
direct-mapped  and  two-way  set-associative  32KB  LI 
cache  with  a  32  cycle  memory  latency  (memlat32- 
{dm,2way}32k)  and  a  direct-mapped  32KB  LI  with  a  64 
cycle  memory  latency  (memlat64-dm32k). 

Ligure  1  shows  the  performance  of  our  benchmarks  in 
terms  of  committed  instructions  per  cycle  (IPC)  for  the 
above  memory  system  configurations.  We  make  several 
observations  from  these  results.  Lirst,  for  three  of  the  inte¬ 
ger  benchmarks  (gcc,  li,  and  vortex)  the  traditional 
memory  systems  achieve  IPC  values  close  to  the  ideal 
memory  system.  This  is  not  surprising,  given  the  low  miss 


ratios  of  these  benchmarks,  2%,  1.4%,  and  1.5%  for  gcc, 
li,  and  vortex,  respectively,  for  a  direct-mapped  32KB 
LI  cache.  The  other  benchmarks  exhibit  LI  miss  ratios 
over  4%,  thus  increasing  the  discrepancy  in  performance 
compared  to  the  ideal  memory  system.  We  note  that 
increased  associativity  has  little  effect  on  overall  IPC,  and 
that  increasing  the  L2  miss  penalty  (memlat64-dm32k) 
dramatically  reduces  the  performance  of  three  floating 
point  benchmarks  (hydro2d,  swim,  tomcatv),  while 
all  other  benchmarks  exhibit  a  small  reduction  in  IPC. 

Another  observation  from  the  data  in  Ligure  1  is  that 
the  performance  of  all  benchmarks  decreases  as  we 
increase  the  latency  for  all  memory  accesses  (ideal,  fixed  8 
cycles,  fixed  32  cycles).  The  integer  programs  are  espe¬ 
cially  sensitive  to  the  increases  in  fixed  cost  memory 
delays,  and  their  IPC  values  drop  below  the  traditional 
memory  system  when  all  memory  accesses  take  8  cycles. 
In  contrast,  the  floating  point  codes  show  less  sensitivity  to 
a  fixed  cost  delay  of  8  cycles.  Lurther  increases  in  memory 
latency  continue  to  decrease  the  IPC  for  all  programs. 
However,  we  point  out  the  results  of  swim  that  show  only 
moderate  reduction  in  IPC  even  when  all  memory  accesses 
take  32  cycles.  This  performance  is  dramatically  higher 
than  the  detailed  two-level  memory  hierarchy  mainly 
because  of  contention  within  the  memory  hierarchy. 

The  above  analysis  of  fixed  cost  memory  accesses  pro¬ 
vides  some  insight  into  latency  tolerance.  However,  it  is  an 
all  or  nothing  approach  where  every  load  has  the  same 
cost,  and  is  suitable  only  for  specific  memory  system 
designs.  The  uniform  cost  model  doesn’t  exist  in  multi¬ 
level  memory  hierarchies  where  some  loads  can  be  satis¬ 
fied  faster  than  others.  Therefore,  as  described  in 
Section  3,  our  methodology  is  targeted  at  measuring  the 


latency  tolerance  of  individual  load  instructions.  The 
remainder  of  this  section  presents  our  results. 

5.2  Determining  that  Loads  Must  Complete 

This  section  investigates  the  policies  for  determining 
when  loads  must  complete  in  order  to  sustain  performance 
comparable  to  an  ideal  memory  system.  We  begin  by 
examining  how  branch  prediction  affects  load  latency  tol¬ 
erance.  This  is  followed  by  analysis  of  instruction  issue 
rate  and  functional  unit  utilization  as  metrics  for  determin¬ 
ing  load  completion. 

Branch  Prediction  and  Load  Latency  Tolerance 

We  compare  perfect  branch  prediction  to  our  base  two- 
level  predictor  using  various  policies  for  determining  that 
loads  should  complete.  For  the  two-level  branch  predictor, 
the  first  policy  always  forces  loads  to  complete  if  any 
branch  attaches  itself  to  the  load’s  dependence  chain  (2- 
lev,  all).  The  next  policy  is  similar,  except  it  forces  a  load 
to  complete  only  if  the  branch  is  mispredicted  (2-lev, 
mispred).1  Finally,  for  both  the  two-level  and  perfect 
branch  predictor  we  evaluate  a  policy  that  does  not  use  any 
branch  information  to  force  completion  of  loads  (2-lev, 
none  and  perfect,  none).  In  all  of  these  simulations  we 
make  the  following  assumptions,  loads  not  forced  to  com¬ 
plete  by  a  branch  are  completed  according  to  an  instruc¬ 
tion  issue  threshold  of  four  instructions  per  cycle,  up  to 
four  loads  can  complete  per  cycle,  which  load  to  complete 
is  determined  by  the  dependence  graph  depth,  and  we 
assume  a  precompletion  time  of  two  cycles.  We  evaluate 
these  parameters  later  in  this  section.  For  comparison,  we 
also  simulate  the  ideal  memory  system  for  both  the  two- 
level  predictor  (2-lev  ideal)  and  perfect  prediction  (perfect 
ideal). 

Throughout  this  section  we  present  our  results  in  two 
parts:  IPC  and  latency  tolerance.  IPC  results  are  presented 
like  those  in  Figure  1.  We  present  latency  tolerance  in 
terms  of  the  fraction  of  loads  that  must  complete  in  a  spe¬ 
cific  number  of  cycles.  Loads  that  must  complete  in  a 
small  number  of  cycles,  do  not  exhibit  latency  tolerance 
and  loads  that  can  take  many  cycles  to  complete  do  exhibit 
tolerance.  Loads  are  forced  to  complete  according  to  the 
appropriate  policy,  or  if  they’ve  been  outstanding  for  32 
cycles. 

Figure  2  shows  the  effects  of  branch  prediction  on  IPC, 
while  Figure  3  shows  the  corresponding  latency  tolerance 
values.  From  Figure  2,  we  see  that  the  2-level  predictor 
policies  that  exploit  branch  information  to  force  load  com¬ 
pletion,  meet  our  goal  of  IPC  close  to  an  ideal  memory 
system.  Furthermore,  we  see  that  using  only  mispredicted 


1 .  This  is  possible  to  simulate  because  in  SimpleScalar  we  can  determine 
very  early  in  the  simulation  cycle  if  a  branch  is  mispredicted. 
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Figure  2:  Branch  Prediction  and  IPC. 
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Figure  3:  Branch  Prediction  and  Latency  Tolerance 


branches  produces  similar  IPC  values  as  forcing  loads  to 
complete  for  all  branches,  and  that  ignoring  branch  infor¬ 
mation  entirely  dramatically  reduces  the  integer  programs’ 
IPC  values.  We  also  see  the  expected  result  that  perfect 
branch  prediction  dramatically  increases  performance  for 
the  integer  codes.  The  two-level  predictor  achieves  only 
88%  accuracy  for  compress,  81%  for  gcc,  86%  for  li, 
and  89%  for  vortex.  The  floating  point  codes  exhibit 
somewhat  higher  prediction  rates  (hydro2d  99%,  swim 
99%,  tomcatv  92%,  wave  90%).  More  sophisticated 
branch  predictors  may  produce  higher  accuracies,  hence 
increased  IPC  rates. 

For  the  integer  benchmarks  compress,  gcc  and  li, 
between  20%  and  25%  of  the  loads  are  completed  based 
on  mispredicted  branch  information,  7%  for  vortex  and 
less  than  2%  for  the  floating  point  benchmarks.  With  load 


completion  based  on  mispredicted  branches  enabled,  we 
see  a  considerable  reduction  in  the  average  dispatch-issue 
delay  for  branches  (time  for  the  operands  of  the  branch  to 
become  available)  for  the  integer  benchmarks.  Also,  the 
number  of  speculative  instructions  executed  drops  by  up  to 
61%  for  the  integer  benchmarks.  The  effect  on  the  floating 
point  benchmarks  is  considerably  less. 

From  Figure  3  we  see  that  loads  do  exhibit  variation  in 
completion  delays,  and  there  is  significant  variation 
among  benchmarks.  Between  13%  and  62%  of  loads  need 
to  complete  in  one  cycle,  while  58%  to  98%  of  the  loads 
need  to  complete  within  eight  cycles.  Furthermore,  the 
floating  point  programs  exhibit  very  little  variation  in  load 
latency  for  the  various  branch-based  load  completion 
schemes.  In  contrast,  the  integer  programs  are  sensitive  to 
these  factors.  In  particular,  differentiating  mispredicted 
branches  from  accurately  predicted  branches  produces 
noticeable  improvements  in  load  latency  tolerance  without 
significant  changes  in  IPC.  Ignoring  branches  altogether 
yields  high  latency  tolerance,  but  the  IPC  values  are  too 
low.  The  final  observation  from  these  results  is  that 
improvements  in  branch  prediction  will  increase  the 
amount  of  latency  tolerance  for  the  integer  programs,  as 
indicated  by  the  increases  seen  for  perfect  branch  predic¬ 
tion. 

Processor  Performance  and  Load  Latency  Tolerance 

The  second  source  of  information  for  determining  if 
loads  should  complete  is  processor  performance.  Here,  we 
examine  the  instruction  issue  rate  and  functional  unit  utili¬ 
zation  as  metrics  for  determining  that  loads  should  com¬ 
plete.  We  assume  that  mispredicted  branches  force 
completion  of  loads,  precompletion  time  is  2  cycles,  and 
up  to  four  loads  can  complete  per  cycle. 

Figure  4  shows  the  effect  of  issue  rate  thresholds  of  1 
(nisi),  2  (nis2)  and  4  (nis4),  and  a  functional  unit  utiliza¬ 
tion  threshold  of  4  (fub4)  on  instructions  per  cycle.  When¬ 
ever  the  processor  issue  rate  (or  number  of  busy  functional 
units  in  the  case  of  fub4)  drops  below  this  threshold,  we 
force  loads  to  complete.  For  comparison,  we  include  the 
IPC  values  for  the  traditional  memory  system  (memlat32- 
dm32k)  and  the  fixed  8  cycle  memory  system.  From  this 
data  we  see  that  functional  unit  utilization  produces 
slightly  higher  IPC  values  than  instruction  issue  rate  (fub4 
vs.  nis4).  The  simulations  also  reveal  that  decreasing  the 
instruction  issue  rate  threshold  produces  a  commensurate 
decrease  in  IPC.  We  note  that  for  all  but  three  of  the  inte¬ 
ger  benchmarks,  IPC  values  are  still  higher  than  the  tradi¬ 
tional  two-level  memory  system  even  when  the  threshold 
is  one  instruction  per  cycle.  As  mentioned  previously,  the 
three  integer  programs  have  low  LI  miss  rates,  and  they 
achieve  near  ideal  performance.  We  also  note  that  for 
swim,  functional  unit  utilization  actually  achieves  higher 
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Figure  5:  Performance-based  completion  and  Latency 
Tolerance 


IPC  than  the  ideal  memory  system  because  of  the  differ¬ 
ence  in  the  processor  instruction  issue  schedule. 

Figure  5  shows  the  corresponding  latency  tolerance  for 
the  various  issue  rate  thresholds.  The  first  observation  is 
that  although  functional  unit  utilization  (fub4)  has  a  slight 
performance  advantage,  it  produces  much  lower  latency 
tolerance  than  the  instruction  issue  rate  metric  (nis4).  We 
also  observe  that  decreasing  the  issue  rate  threshold  can 
dramatically  increase  the  latency  tolerance.  This  matches 
our  intuition  that  if  the  processor  is  consuming  data  at  a 
lower  rate,  it  can  take  longer  for  the  data  to  arrive.  How¬ 
ever,  the  cost  of  this  increased  latency  tolerance  is  reduc¬ 
tion  in  IPC.  A  four  instruction  per  cycle  threshold 
produces  IPC  values  within  8%  of  the  ideal  memory  sys¬ 
tem,  whereas  a  threshold  of  one  instruction  per  cycle  pro- 


duces  IPC  values  up  to  35%  lower  than  ideal.  Therefore, 
we  do  not  consider  thresholds  of  one  or  two  further  in  this 
paper.  Similarly,  we  omit  further  discussion  of  functional 
unit  utilization  since  the  decreased  latency  tolerance  more 
than  offsets  the  marginal  increase  in  IPC. 

5.3  Determining  Which  Loads  to  Complete 

Now  that  we’ve  determined  that  either  mispredicted 
branches  attaching  to  a  load’s  dependence  graph  or  the 
instruction  issue  rate  falling  below  4  should  force  load 
completion,  we  focus  on  identifying  which  outstanding 
load  to  complete.  Completing  the  load  at  the  head  of  the 
LSQ  (fifo)  can  prevent  processor  stalls  due  to  the  RUU/ 
LSQ  being  full  and  thus  help  alleviate  the  finite  resource 
problem.  On  the  other  hand,  completing  the  load  with  the 
maximum  depth  (in  cycles)  of  dependent  instructions  (dg) 
will  help  tackle  the  data  dependency  problem  by  freeing 
up  the  most  dependent  instructions  and  thereby  keep  the 
processor  maximally  utilized 

Figure  6  shows  the  effect  of  the  load  selection  policy  on 
instructions  per  cycle  and  Figure  7  shows  the  correspond¬ 
ing  latency  tolerance  values.  The  figures  show  that  both 
the  fifo  and  dg  load  selection  policies  produce  almost 
identical  IPC  numbers.  However  completing  loads  based 
on  the  dependence  graph  depth  increases  the  latency  toler¬ 
ance  of  loads  for  the  floating  point  benchmarks.  These 
results  provide  further  evidence  of  the  variation  in  load 
latency  tolerance,  and  indicate  that  completing  loads  in 
program  order  is  not  necessarily  the  “best”  schedule  for 
exploiting  latency  tolerance. 

5.4  Determining  When  to  Complete  Loads 

Having  decided  to  complete  the  loads  with  the  maxi¬ 
mum  depth  of  dependent  instructions,  we  proceed  to 
investigate  when  such  loads  should  be  completed.  The 
load  completion  time  controls  the  amount  of  time  avail¬ 
able  for  the  pipeline  to  fill  up  with  ready  instructions.  We 
study  the  effect  of  completing  loads  the  same  cycle  as 
detecting  performance  degradation  (pipeline  fill-up  time  of 
zero  -  futO),  one  cycle  earlier  (futl)  and  two  cycles  earlier 
(fut2)  on  load  latency  tolerance.  Note  this  only  applies  to 
loads  not  forced  to  complete  by  a  mispredicted  branch. 
From  Figure  8,we  see  that  IPC  goes  down  for  all  bench¬ 
marks  except  compress,  gcc  and  li  as  we  decrease  the 
pipeline  fill-up  time.  These  three  benchmarks  have  a  sig¬ 
nificant  number  of  loads  completed  due  to  mispredicted 
branches.  Hence  fill  up  time  has  less  impact.  Swim  shows 
the  highest  degradation  in  IPC,  going  down  from  within 
3%  of  ideal  for  fut2  to  within  10%  of  ideal  for  futO.  Look¬ 
ing  at  the  corresponding  latency  tolerance  graphs  in 
Figure  9,  latency  tolerance  generally  increases  as  we 
decrease  the  fill  up  time.  These  results  match  our  expecta¬ 


Figure  6:  Load  Selection  and  IPC 
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tions,  completing  loads  earlier  obviously  decreases  their 
latency  tolerance.  Furthermore,  the  processor  requires 
some  recovery  time  as  results  propagate  down  the  depen¬ 
dence  graph  and  a  sufficient  number  of  instructions 
become  ready  to  execute.  Using  a  pipeline  fill-up  time  of  2 
cycles  (fut2)  produces  the  best  combination  of  IPC  and 
latency  tolerance  numbers. 


5.5  Limiting  the  Number  of  Completed  Loads 

Finally,  achieving  IPCs  close  to  that  of  an  ideal  mem¬ 
ory  system  will  likely  require  completing  more  than  one 
load  per  cycle.  Keeping  all  other  parameters  fixed,  we 
examine  limits  of  one  (nil),  two  (nl2)  and  four  (nl4)  on  the 
number  of  loads  that  can  complete  in  a  single  cycle. 
Figure  10  shows  the  impact  of  these  limits  on  IPC  and 
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Figure  10:  Loads  Completed  and  IPC 
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Figure  9:  Completion  Time  and  Latency  Tolerance 


Figure  1 1  shows  the  corresponding  latency  tolerance  num¬ 
bers.  The  overall  trend  we  observe  from  this  data  is  that,  as 
we  increase  the  limit  on  the  number  of  loads  that  can  com¬ 
plete  in  a  cycle  from  1  to  4,  IPC  increases  and  the  latency 
tolerance  decreases.  This  is  in  line  with  our  expectations, 
since  a  lower  limit  causes  some  loads  to  complete  later 
than  they  should  according  to  our  policies,  which  causes  a 
decrease  in  IPC. 

5.6  Effects  of  Processor  Architecture 

To  evaluate  the  impact  of  various  microarchitectural 
changes  on  our  measurements,  we  evaluated  a  configura¬ 
tion  with  128  RUU  entries  and  64  LSQ  entries,  and  a  four 
issue  processor  for  both  the  128/64  and  256/128  RUU/ 
LSQ  configurations.  In  the  case  of  the  four  issue  proces¬ 
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Figure  11:  Loads  Completed  and  Latency  Tolerance 


sor,  we  used  an  issue  rate  threshold  of  three  instructions 
per  cycle  and  up  to  three  loads  can  complete  in  a  single 
cycle.  Figure  12  shows  the  effect  of  various  processor  con¬ 
figurations  on  IPC  and  Figure  13  shows  the  corresponding 
latency  tolerance  numbers.  The  graphs  are  labeled  accord¬ 
ing  to  issue-width/RUU  entries/LSQ  entries. 

The  first  observation  from  this  data  is  that  the  issue 
width  has  a  much  larger  impact  on  IPC  than  buffer  space 
does.  We  note  that  all  the  IPC  values  shown  are  within 
11%  of  the  corresponding  ideal  memory  system.  Also,  we 
see  that  the  IPC  values  are  mostly  independent  of  the  num¬ 
ber  of  RUU/LSQ  entries.  However,  the  situation  is  very 
different  with  respect  to  the  amount  of  latency  tolerance. 
From  Figure  13,  we  see  that  latency  tolerance  increases 
when  either  the  issue  width  decreases  or  the  RUU/LSQ 
entries  increases.  The  floating  point  programs  exhibit 


Figure  12:  Processor  Architecture  and  IPC  (Issue- 
width/RUU-size/LSQ-size) 
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Figure  13:  Processor  Architecture  and  Latency 
Tolerance  (Issue-width/RUU-size/LSQ-size) 


larger  increases  than  the  integer  programs.  The  most  strik¬ 
ing  change  is  for  swim  with  the  4/256/128  configuration, 
nearly  all  its  memory  references  can  complete  in  the  32 
cycle  limit. 


5.7  Traditional  Memory  Hierarchies 

The  results  presented  thus  far  indicate  that  programs  do 
exhibit  variations  in  load  latency  tolerance.  We  now  evalu¬ 
ate  how  well  traditional  memory  systems  meet  an  applica¬ 
tion’s  inherent  latency  demands.  To  determine  this,  we 
track  which  level  in  a  traditional  memory  hierarchy  satis¬ 


fies  each  load  and  increment  a  counter  for  the  correspond¬ 
ing  measured  latency  tolerance.  This  produces  a  histogram 
for  each  level  in  the  memory  hierarchy,  and  allows  us  to 
evaluate  the  memory  hierarchy’s  effectiveness  with  vary¬ 
ing  access  times  for  each  level.  For  example,  if  the  L2 
access  time  is  8  cycles,  then  to  avoid  performance  degra¬ 
dation,  loads  with  measured  latency  tolerance  less  than  8 
cycles  should  be  satisfied  by  the  LI  cache.  Similarly,  loads 
with  measured  latency  greater  than  or  equal  to  8  cycles, 
but  less  than  main  memory  access  time,  could  be  satisfied 
by  the  L2  cache.  Clearly,  we  can  perform  this  computation 
for  arbitrary  access  times.  Furthermore,  we  can  track  the 
discrepancy  between  where  a  load  should  be  satisfied  and 
where  it  is  actually  satisfied  in  the  memory  hierarchy. 

We  performed  this  analysis  on  8KB,  16KB,  and  32KB, 
direct-mapped  and  two-way  set-associative  LI  caches  with 
32-byte  blocks,  using  the  base  L2  cache  configuration 
(1MB,  64-byte  blocks).  For  brevity,  we  report  results  only 
for  the  direct-mapped  caches  and  an  L2  access  time  of  8 
cycles  on  the  8-issue  processor  with  256  RUU  entries  and 
128  LSQ  entries.  Table  1  shows  the  effectiveness  of  the 
traditional  two-level  memory  hierarchies  at  capturing 
latency  tolerance.  The  top  row  for  each  benchmark  in  the 
table  indicates  the  percentage  of  loads  satisfied  by  a  partic¬ 
ular  level  in  the  memory  hierarchy  with  measured  latency 
tolerance  less  than  8  cycles.  Similarly,  the  bottom  row  for 
each  benchmark  corresponds  to  loads  with  measured 
latency  greater  than  or  equal  to  8  cycles.  The  levels  of  the 
memory  hierarchy  are  the  load/store  queue  (LSQ),  LI 
cache,  L2  cache,  and  main  memory. 

We  focus  our  discussion  on  the  LI  and  L2  caches.  In 
particular,  the  number  of  low  latency  loads  (<  8  cycles)  not 
satisfied  by  the  LI  cache  and  the  number  of  high  latency 
loads  (>=  8  cycles)  satisfied  by  the  LI  indicates  the  mis¬ 
match  between  the  applications  latency  demands  and  the 
latency  incurred  in  the  memory  hierarchy.  From  these 
results  we  see  that  some  benchmarks  exhibit  significant 
discrepancy  between  their  latency  demands  and  the 
latency  that’s  incurred  in  a  real  memory  hierarchy.  Con¬ 
sider  the  16KB  cache  for  swim,  16%  of  loads  require  low 
latency  but  are  satisfied  by  the  L2  cache,  whereas  29%  of 
loads  have  enough  latency  tolerance  and  are  LI  cache  hits. 
Ideally,  those  references  should  be  swapped,  with  the  LI 
cache  satisfying  the  low  latency  loads  and  the  L2  satisfy¬ 
ing  the  high  latency  loads 

Compress  is  another  striking  example,  with  16%  to 
22%  of  its  loads  requiring  low  latency  but  missing  in  the 
LI  cache.  However,  we  note  that  compress  has  very  few 
high  latency  loads  for  this  processor  configuration.  In  con¬ 
trast,  for  swim  2%  to  36%  of  loads  require  low  latency  yet 
miss  in  the  LI  cache,  whereas  13%  to  37%  of  loads  are 
high  latency  and  hit  in  the  LI.  Finally,  for  the  floating 
point  benchmarks  a  noticeable  fraction  (l%-3%)  of  loads 
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Table  1:  Effectiveness  of  Traditional  Memory  System  at  Capturing  Latency  Tolerance 


require  low  latency  but  are  satisfied  by  main  memory.  As 
the  disparity  between  processor  cycle  time  and  main  mem¬ 
ory  access  time  increase,  even  this  small  fraction  of  refer¬ 
ences  can  dramatically  reduce  overall  performance. 

The  mismatch  between  an  application’s  latency 
demands  and  the  actual  latency  is  dependent  on  the  perfor¬ 
mance  of  the  real  memory  hierarchy.  In  general,  we  see 
that  reducing  the  LI  cache  size  increases  the  discrepancy. 
The  floating  point  benchmarks  show  dramatic  increases  as 
the  cache  size  is  reduced.  Swim  and  tomcatv  go  from 
2%  low  latency  LI  load  misses  in  the  32KB  cache  to  36% 
and  22%,  respectively,  in  the  8KB  cache.  Although  not 
shown,  reducing  the  boundary  from  8  cycles  to  6  cycles 
can  increase  the  number  of  low  latency  load  misses  in  the 
LI  cache.  Also,  our  results  (not  shown)  indicate  that 
increasing  the  LI  associativity  has  very  little  effect  on  the 
match  between  the  application’s  latency  demands  and  the 


memory  hierarchy’s  performance,  when  compared  to  the 
direct-mapped  caches. 

6  Conclusion 

This  paper  explores  latency  tolerance  in  dynamically 
scheduled  processors.  Our  two  primary  contributions  are  a 
quantitative  evaluation  of  applications’  inherent  latency 
tolerance  in  dynamically  scheduled  processors,  and  analy¬ 
sis  of  how  well  a  conventional  memory  hierarchy  meets 
the  application's  latency  demands.  We  compute  latency 
tolerance  by  measuring  the  number  of  cycles  a  load  could 
take  to  complete  without  adversely  affecting  performance 
compared  to  an  ideal  memory  system  where  all  loads  com¬ 
plete  in  one  cycle. 

Our  measurements  show  that  load  latency  is  a  function 
of  the  number  and  type  of  dependent  instructions.  In  par¬ 
ticular,  mispredicted  branches  have  a  significant  impact  on 


measured  latency  tolerance  for  the  integer  benchmarks. 
We  also  observe  that  most  programs  do  exhibit  some 
latency  tolerance,  and  still  obtain  IPC  values  comparable 
to  an  ideal  memory  system.  Our  results  show  that  between 
1%  and  62%  of  loads  must  complete  in  one  cycle,  and 
between  5%  and  98%  must  complete  within  8  cycles, 
depending  on  processor  configuration. 

We  show  that  for  some  benchmarks,  a  significant  num¬ 
ber  of  loads  could  be  satisfied  in  latencies  on  the  order  of 
second  level  cache  access  times,  while  others  must  be  sat¬ 
isfied  by  the  first  level  cache.  Unfortunately,  this  discrep¬ 
ancy  in  latency  tolerance  is  ignored  by  conventional 
memory  hierarchies  that  always  fetch  data  into  the  primary 
cache.  We  plan  to  investigate  methods  for  utilizing  latency 
tolerance  information  in  memory  hierarchy  management. 
Prefetching  [11]  is  clearly  one  avenue  for  exploiting  this 
information.  Alternatively,  we  could  place  data  in  the 
memory  hierarchy  according  to  the  corresponding  load’s 
latency  tolerance  and  bypass  higher  levels  of  the  memory 
hierarchy  [1,7,20],  or  prioritize  requests  in  a  system  that 
supports  multiple  outstanding  misses. 
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